- If you lost points for the dashboard, show Ethan your .Rmd document to get points back
- Treat the projects like a formal report
- Proofread! Make sure plots render.
- Make your code/plots easy to read! Include comments and labels.
Remember that once we have a model of the data generating process we can pretend that our model is the data generating process and simulate data.
# Load packages. library(tidyverse) library(tidymodels) # What's this again?! set.seed(42)
# Set variable and parameter values. nobs <- 300 intercept <- 30 slope <- 9 # Simulate data. sim_data_01 <- tibble( discount = round(runif(nobs, min = 0, max = 20)), sales = intercept + slope * discount + rnorm(nobs, mean = 0, sd = 5) ) sim_data_01
## # A tibble: 300 × 2 ## discount sales ## <dbl> <dbl> ## 1 18 192. ## 2 19 193. ## 3 6 89.8 ## 4 17 182. ## 5 13 145. ## 6 10 114. ## 7 15 165. ## 8 3 53.0 ## 9 13 144. ## 10 14 162. ## # ℹ 290 more rows
sim_data_01 |> ggplot(aes(x = discount, y = sales)) + geom_jitter(size = 2, alpha = 0.5) + geom_smooth(method = "lm")
We can use the binomial distribution to simulate binary data. In a binary variable there are two levels, with the level equal to zero known as the baseline level or reference level.
# Simulate some more data. sim_data_02 <- tibble( coupon = rbinom(nobs, size = 1, prob = 0.7), sales = intercept + slope * coupon + rnorm(nobs, mean = 0, sd = 5) ) sim_data_02
## # A tibble: 300 × 2 ## coupon sales ## <int> <dbl> ## 1 1 41.1 ## 2 1 43.9 ## 3 1 43.2 ## 4 0 26.7 ## 5 1 46.8 ## 6 1 30.9 ## 7 1 43.3 ## 8 1 36.4 ## 9 1 29.4 ## 10 1 29.7 ## # ℹ 290 more rows
sim_data_02 |> ggplot(aes(x = coupon, y = sales)) + geom_jitter(size = 2, alpha = 0.5) + geom_smooth(method = "lm")
Remember that when we fit a linear model we are finding the line of best fit and getting parameter estimates.
# Fit the first model.
fit_01 <- linear_reg() |>
set_engine("lm") |>
fit(sales ~ discount, data = sim_data_01)
# Fit the second model.
fit_02 <- linear_reg() |>
set_engine("lm") |>
fit(sales ~ coupon, data = sim_data_02)
Remember that our goal is to use the model to estimate the unobserved parameters from the data.
# Evaluate model fit. fit_01 |> tidy()
## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 29.8 0.537 55.5 2.92e-159 ## 2 discount 9.01 0.0466 193. 2.46e-315
\[sales = 29.80 + 9.01 \times discount\]
If \(x\) is continuous:
What’s different again about sim_data_02?
# Evaluate model fit. fit_02 |> tidy()
## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 29.6 0.518 57.2 1.00e-162 ## 2 coupon 9.07 0.617 14.7 3.13e- 37
\[sales = 29.60 + 9.07 \times coupon\]
If \(x\) is discrete:
So far the parameter estimates have been point estimates, a single number that represents our best guess.
But this is a statistical model – there is always uncertainty (i.e., error). We can produce an interval estimate of the parameters, a range of numbers that represent our best guess.
# Include confidence intervals. tidy(fit_01, conf.int = TRUE)
## # A tibble: 2 × 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 29.8 0.537 55.5 2.92e-159 28.7 30.9 ## 2 discount 9.01 0.0466 193. 2.46e-315 8.92 9.10
These interval estimates are called confidence intervals. Given our model and the data, they are our best guess of the unobserved parameters.
If the confidence interval doesn’t include zero, we can say that the parameter estimate is statistically significant (a.k.a., significant or significantly different from zero).
We reach the same conclusion using a confidence interval as when using a p-value!
# Include confidence intervals. tidy(fit_02, conf.int = TRUE)
## # A tibble: 2 × 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 29.6 0.518 57.2 1.00e-162 28.6 30.7 ## 2 coupon 9.07 0.617 14.7 3.13e- 37 7.86 10.3
We’ll examine the properties of point estimates and confidence intervals using simulation.
Eventually we’ll want to compare how well different models fit the same data. To do that efficiently we need a single number that describes the overall model fit.
# Look at the r.squared. glance(fit_01)
## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.992 0.992 4.73 37436. 2.46e-315 1 -891. 1787. 1798. ## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The \(R^2\) is the percent of variation in \(y\) that can be explained by the model (i.e., the explanatory variable(s)). Closer to 1 is better! This is easy to misuse, so it will be our measure of overall model fit only temporarily.
While the parameter estimates are often the object of interest for an inferential model, we of course can predict the outcome using the fitted model:
\[sales = 29.60 + 9.07 \times coupon\]
To predict the outcome we need new data that represents the counterfactuals we’d like to predict to feed into the fitted model.
# Column names need to match the fitted model. scenarios <- tibble(coupon = c(0, 1))
We can predict using the fitted model and the new data.
# Predict and bind on the new data. predict(fit_02, new_data = scenarios) |> bind_cols(scenarios)
## # A tibble: 2 × 2 ## .pred coupon ## <dbl> <dbl> ## 1 29.6 0 ## 2 38.7 1
Remember that we have more than point estimates. We can compute confidence intervals for predictions as well (predictive intervals).
# Predict and bind on prediction intervals. bind_cols( predict(fit_02, new_data = scenarios), predict(fit_02, new_data = scenarios, type = "pred_int"), scenarios )
## # A tibble: 2 × 4 ## .pred .pred_lower .pred_upper coupon ## <dbl> <dbl> <dbl> <dbl> ## 1 29.6 20.0 39.3 0 ## 2 38.7 29.1 48.3 1
When we work with real data, it can be hard to tell the difference between doing something wrong with the model and doing something wrong with the code.
Another reason to simulate data is to prove that our code is working by recovering the parameter values we used to simulate the data.
tidy(fit_01, conf.int = TRUE)
## # A tibble: 2 × 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 29.8 0.537 55.5 2.92e-159 28.7 30.9 ## 2 discount 9.01 0.0466 193. 2.46e-315 8.92 9.10
Summary
Next Time
Supplementary Material
Artwork by @allison_horst
discount using a uniform distribution, coupon as a binary variable using the binomial distribution, and sales as a function of only discount, using a linear model.sales ~ discount and sales ~ discount + coupon.